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Abstract 

Gaussian process is a theoretically appealing model for nonparametric analysis, 
but its computational cumbersomeness hinders its use in large scale and the existing 
reduced-rank solutions are usually heuristic. In this work, we propose a novel con¬ 
struction of Gaussian process as a projection from fixed discrete frequencies to any 
continuous location. This leads to a valid stochastic process that has a theoretic sup¬ 
port with the reduced rank in the spectral density, as well as a high-speed computing 
algorithm. Our method provides accurate estimates for the covariance parameters 
and concise form of predictive distribution for spatial prediction. For non-stationary 
data, we adopt the mixture framework with a customized spectral dependency struc¬ 
ture. This enables clustering based on local stationarity, while maintains the joint 
Gaussianness. Our work is directly applicable in solving some of the challenges in the 
spatial data, such as large scale computation, anisotropic covariance, spatio-temporal 
modeling, etc. We illustrate the uses of the model via simulations and an application 
on a massive dataset. 
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1 Introduction 


Gaussian process has a wide range of applications from computer experiment emulations 

The covariance is the 


Toeppky et al. 

2(1(1!) 

) to spatial data analysis ( 

Gressie 

2015 


crucial component but it involves two crucial challenges: its evaluation in the likelihood is 
prohibitively cumbersome and the consideration for non-stationarity is difficult. 

Firstly, to address the computational issue, various reduced-rank approaches have been 


proposed: Nystrom method uses truncation to the top m eigenvectors in the matrix (Smola 


and Schlkopf| 2000); the Gaussian predictive process (Banerjee et ah, 2008) models the full 


observation as the prediction mean from a small set (with size m) of the knot process; the 
spatial random effects model (Gressie and Johannesson, 2008) treats the covariance as a 
quadratic transform of m predetermined basis functions. With small m, these methods 
reduces the computational burden and makes the Gaussian process fitting viable in large 
data. However, the small m usually incurs sacrifices in resolution and its choice is commonly 
heuristic. 

In spectral domain, some alternative approaches have been proposed to overcome the 
computational obstacle. This pioneering work was Whittle’s likelihood (Whittle, 1953) in 
time series analysis, where the covariance is approximated by a discrete Fourier transform of 
the spectral density, if the data are regularly spaced. Multiple research methods have been 
proposed to accommodate the irregularity in the lattice: Fuentes (2007) uses lattice binning 
and mean Filing to minimize the effects of irregularity; Stroud et ah (2014) uses lattice 


embedding and Bayesian latent framework to estimate the incomplete lattice data; Xu et ah 


(2015) models the randomly located data as realization of a Gaussian Markov random field 
conditional on the lattice. These methods have very appealing computational advantage as 
the likelihood evaluation is 0 {n\og 2 n) (faster than the reduced rank methods) and do not 
involve resolution reduction. On the other hand, there is not much theoretic development 
on the relaxed assumption about the lattice. Since conditioning on any arbitrary points off 
the data lattice would involve changes in the data lattice and the matrix decomposition, 
hence the likelihood does not lead to a valid stochastic process in the continuous data 
space; likewise, prediction formulation (Kriging (Gressie, 2015)) is much restrictive as it 
only projects to the lattice points. 

Secondly, non-stationarity is very common in real-world data collected over large space. 
A general strategy is letting the covariance function (or the spectral density) vary with 


locations, as has been studied by Paciorek and Schervish (2006) and Anderes and Stein 


(2011). In this regard, Priestley (1965) proposed the idea of “semi-stationary process” 


which matured to “locally stationary process” ( 

Dahlhaus, 

2000 

) with efficient algorithms 

(Guinness and Stein 

2013 

Guinness and Fuentes, 

2015 

) to partition the full domain into 


small stationary regions. For this spatial partitioning, an inhnite mixture process pro¬ 
vides more theoretically sound support from a probabilistic point of view: given the latent 
class assignment (also known as “clustering”), the local stationarity is achieved inside each 
class. The work in this area includes a spatially varying weight and stationary Gaussian 


process component (Duan et ah, 2007; Rodriguez et al.| 2010 Rodriguez and Dunsonj 


2011). Nevertheless, it is important to point out the discrepancy between the locally sta¬ 


tionary process and the mixture construction: in the former, the whole random vector is 
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still correlated across different stationary regions; in the latter, it is assnmed independent 
given the different latent class assignment (i.e. after the data are clustered). In the mix¬ 
ture distribution case, to represent the correlation on the full domain, one has to use the 
marginal distribution, which is no longer Gaussian. 

In this paper, we propose a new spectral construction of the Gaussian process. Instead 
of constraining the data to a lattice, we only £x the frequencies to a hnite support set. 
This allow us to construct a valid Gaussian process as a noisy projection from a hnite 
spectral set to any arbitrary and uncountable set in d-dimensional data space The 
duality in frequency lattice and continuous space enables us to take advantages in both the 
fast sampling algorithm and the common tools such as Kriging. Due to the sparsity in the 
frequency support, the rank of the matrix can be automatically reduced and the computa¬ 
tion is further accelerated to 0{m\og2n). As for the non-stationary consideration, using 
different way of projection from the shared frequency random vector leads to a valid non- 
stationary Gaussian covariance. We customize the mixture framework with this spectral 
dependency and show that the distribution is still multivariate Gaussian, even conditional 
on the different cluster labeling. We illustrate the advantages of the proposed work with 
simulations and a large spatio-temporal application with 1,343,752 data entries. 


2 Functional Gaussian Process 


2.1 Spectral Construction 


As shown by Priestley (1965) and later by Higdon (1998), a weakly stationary Gaussian 
process can be represented as Zs = exp(is^ uj)g^^‘'^(uj)Y(uj)duj + (s- In this formu¬ 

lation, V(u>) ~ GAt(0,I) is an orthogonal complex normal process folded over 0, that is, 

Y(cj) = Yi(cL))+iY 2 (cL>) with Yi(cl>), Y2(cl)) N{0, 1 ) with constraint that Y(uj) = Y{—oo). 

We added the remaining ~ At(0,(j^) to ensure the full rank in the hnite dimensional 
density. By Bochner’s theorem. The function g{u}) = is a positive and real value 

function, which can be represented as the forward Fourier transform of the covariance 
function. 

We now give a similar construction but using discrete representation: 


Z. = 


exp{is^uJi)g^^‘^{u;i)Y{u;i)/y/n + 


( 1 ) 




where each u^i represents a d-dimensional coordinate from a Gartesian product set of 

f rni/2 A mi/2-1 A !ZnA A A X X f mrf/2-1 a 'HYl A A whprp m r ^ < 

n(,), m = 77117712...md is the total number of coordinates. They have fixed increment and 
symmetry about 0. We assume the mild condition that g{oj) ~ 0 if falls outside of the 
region W = (—^Ai, ^Ai) x ... x (—^A^, ^A^). Formally, in each sub-dimension of 
LO, for any arbitrarily small e > 0 there is an n;o(e) such that if |n;| > |Li;o(f )|5 
This condition is satisfied by most of the spectral density functions, since the detectable 
frequencies is always bounded by the sampling rate (e.g. the minimum distance). And 
commonly used covariance family, such as Matern, has the spectral density g{p,u) as a 
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decreasing function in both the range parameter p and frequency oj: when p is large, g 
declines rapidly in u] when p is small, we can scale up the location unit so that p increases 
accordingly; when p is extremely small, the correlation is simply negligible and not of 
interests. In all cases, we can have a valid condition for this truncation. 

Using matrix representation of row vector Qg = and diagonal matrix 

G = the construction of ([^ can be viewed as the result of a projection from 

an m-element vector Y in frequency space to an n-element location space QgG^/^Y and 
adding random noise. We refer the set as the frequency support. 

For an hnite set of arbitrary locations S = {sj}j=i..jv with Sj G M'^, it can be derived 
that the joint distribution is: 

Zs~N(0,QsGQ^ + Ia2) (2) 

where Qs is the n-by-m matrix formed by stacking the row vectors Qs- and Qg is its 
conjugate transpose. It can be derived that the covariance function between any two 
locations Sj, is 


Cov{sj,Sk) = expjiuj (sj - Sk)}g{u)i)/n + a‘^lj=k (3) 

i=i 

We would like to point out that (|^ forms a valid covariance matrix, as stated by the 
theorem below. For the conciseness of presentation, all the proofs will be shown in the 
appendix. 

Theorem 1. The covariance matrix formed by is real and positive definite. 

It can also be observed that the function of Q is shift-invariant in (s^, s^), therefore Zs 
is weakly stationary. In fact, any weakly stationary Gaussian distribution can be viewed 
as the limit of formulation in ([^, due to the following property: 

Theorem 2. If we denote a weakly stationary covariance function by C{sj — s^), and its 
spectral density as g{cij) = ^.^^exp{—ix’"(jj)C{x)d^. Under mild condition that g{uj ) < e */ 
(jL> ^ W, when m = goes to infinity, the function specified by converges to the 

covariance function, that is: 


ItnClm^ooC Ov{Sj, Sjfi C{Sj ^k') 

with the rate of convergence being 0(l/m^). 

This property is important as under the case of relative large n(,), the spectral construc¬ 
tion in ([^ can be treated as an approximation to a weakly-stationary covariance. However, 
it would not be fair to view this simply as an approximation method, in fact, when the 
frequency support set {n;;} is hxed, the distribution specihed in ([^ can be extended to 
uncountable set of locations and form a valid stochastic process. 

Theorem 3. With the freguency support set fixed, the finite dimensional distri¬ 

bution of Z specified by ^ satisfies the Kolmogorov consistency criteria: 
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1 . For any finite set {si,S 2 , inMf and all of its permutation 8 ^, 2 , ■■■y^Trn}, we 

have 8 ,^ 2 ) ■■•5 8 ,^^) p(si, S 2 ,..., 8 jj) 

2. For any location in Sk G we have p{si,S 2 , ...,Sn) = J^aP{si,S 2 , ...,Sn,Sk)dsk 

Therefore, Zjjd is a valid stochastic process. We name this process as the functional 
Gaussian process (FGP). 

2.2 Spectral Dimension Reduction 

We have established the asymptotic equivalence of the functional Gaussian process and the 
one constructed by the traditional covariance functions. Now we demonstrate its unique 
advantage in computation. 

We hrst note the properties of the n-by-m matrix Qs under two conditions: 

Theorem 4. If the location vector has S = ni} x n 2 } x ... x {!,...,nd}, and 

the frequency support {coi}i has A() = vr and m(.) < nf), then we have QgQs = I- When 
m{.) = n{.), QsQs = I- 

This is very important as it greatly reduces the computational burden in the likelihood, 
with two steps of truncations. We hrst truncate the frequency support to (—vr, vr)'^ (that is 
A(.) = 71 and m(.) = n{.)), after ensuring g{u}) < e when |a;(.)| > vr in all the directions, for 
arbitrarily small e. If this condition were not met (although unlikely), we can manually scale 
up in S and thereby increase the range parameter p, which leads to more rapid decrease of 
as discussed in the previous section. 

As the second step, we further truncate the rank of Qs down to m(.) < n(.), by 
eliminating the frequencies with g{u:) cr^. For illustration, we denote the full rank 
matrix with subscript n and reduced rank with m. Since we have g{(jj)/{g{(jj) + cr^} = 
o{g{u:)) for the omitted {n — m) frequencies, using Woodbury identity we have (QnG„Q* + 
- a-2Q„G„(G„ + I„a2)-iQ; ^ - a-2Q^G^(G™ + 

and IQjiG^Q*+I(j^| = |G„ + I„(T^| ~ |Gm + ImO'^||I(n-m)C’'^|. Our empirical hnding is that 
truncation at ^(n;) > 0.01 ■ results in indistinguishable parameter estimation. More will 
be discussed in the simulation studies. 


2.3 Bayesian Modeling and Posterior Computation 


Similar to other spectral methods (e.g. Stroud et ah (2014)), we utilize lattice embedding 
to obtain the frequency estimates. Before we elaborate the method, it is worth pointing out 
that the lattice latent variable is only an auxiliary tool for posterior estimation, provided 
a hnite set of data are collected. This does not contradict the dehnition of functional 
Gaussian process on any continuous domain. In fact, unlike other lattice method, we can 
have even multiple observations on the same location and our construction is still valid. 

We now assume the data Zg are originally observed on n coordinates S. We also 
assume there is a latent random variable Zg on the lattice S = {l,...,? 7 ,i} x {!,...,n 2 } x 
... X {1, ...,77,^}. To satisfy the two conditions, we divide S by the greatest common divisor 
in each direction of the {sj — s*,} and shift the smallest coordinate to (1,1,..., 1). We denote 
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the transformed set as So, so we have So C S. As it is common to relax the homogeneity 
assnmption abont the random noise, we added a diagonal matrix with location specihc 
random noise Vs^. We have: 


Zsol^s ~ N(Zs„, Vs„) 

Zs~N(0,QsGQs* + Ict2) 

where Vs„ = {z/j} is a positive diagonal matrix. 

The angmented likelihood becomes a prodnct of m + h independent normal density: 


(4) 


p(Zso,Zs) oc + + 


i=i 


TT ( ^ ~ ^SoiY , , ( 2\^ 

YYexpi - -{ --2- + log[iyi)} 


(5) 


2=1 


where (Q*Z); denotes the /th element Q*Z. We wonld like to point ont the famons Whit¬ 
tle’s likelihood (Whittle, 1953) is a special case of the hrst step trnncation to (—7r,7r)'’* in 
dimension n on a fnll lattice. Here we present a more general case with farther dimen¬ 
sion redaction from n to m. The prodnct of Q*Z corresponds to the m-element trnncated 
inverse Fonrier transform of Z. The fnll compntation can be carried ont at a complexity 


of 0(nlog2ra) nsing Fast Fonrier Transform (Cooley and Tnkey (1965)) or even faster at 


0(mlog2n) with the recently invented Sparse Fonrier Transform (Hassanieh et ah, 2012). 


For posterior sampling, we nse Gibbs sampling from the individnal fnll conditional dis- 
tribntion. However, it is compntationally demanding to g enerate random sam ples from 


ZsiZc 


In a similar stndy of stationary Ganssian process. 


Strond et ah 


( 2014[ ) proposed 

using a fc-iteration solver at each step with a complexity at 0 {kn\og 2 ^)- Here we in¬ 
troduce a more efficient sampling scheme at 0 {n\og 2 n) with the latent lattice variable 
pL = QsG^/^Y followed by Zs ~ N(^, Icr^). 

We denote the covariance parameter in G as 0 and its prior distribution as p{0). As 
the covariance function has the general form of C{x\0) = 6 ih{x/ 62 ), to assure posterior 
propriety, we assign proper diffuse prior, /G(0.1, 0.1) for the scale parameter 6 i and uniform 
prior U{ 0 , 1000) for the covariance parameter 62 - If a stricter condition can be met such 
that Ui = Uj for any the objective improper prior such as Berger et ah ( 2001[ ) is more 
desirable. The full conditional distributions can be derived: 


Y|Zs ~ CN{G^/^{G + a^{G + la^)’^), 

M =QsG1/2Y, 

p{e\z) (xp{z\o)p{0), 

Zilft.Zi Af((i + + §). (4 + if * e S„, 


( 6 ) 


' cr^ ' u?' 


' cr^ ' z/p' 
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2.4 Predictive Distribution 


It is a common use of Gaussian process to make prediction at a set of arbitrary locations, 
which is known as Kriging (Cressie, 1988). The functional Gaussian process also accom¬ 


modates this demand, since the covariance matrix between two sets Si and S 2 is simply 
Qs^GQg^. The predictive distribution is Zg^lZsi ~ N(Qs 2 GQsi*(QsiGQsi + + 

Qs.GQs/ + + Vs, - Qs,GQs,*(Qs,GQ^^ + + VsJ-'Qs.GQs/). 

To facilitate the computation, when conditioning on the latent variable Y in the frequency 
space, the predictive distribution can be further simplified to: 


Zs,|Y~N(Qs,Gi/2Y,V + Ia^), 
which is very efficient due to the independent condition. 


(7) 


2.5 Misaligned Model for Lower Resolution Analysis 

It is worth noting that, the dimension reduction in our method is a result of the inherent 
sparsity in the frequencies, as opposed to the approximation at the lower resolution. In 
fact, all the latent variables in the location space are modeled at the same resolution as 
the data themselves. Therefore, FGP provides estimation directly on the hnest resolution. 
Thanks to the high computational efficiency, it is easy to set up latent variables on a large 
lattice without costing much, since the time demand only increases linearly. 

On the other hand, there are scenarios where one may be interested in a lower resolution 
analysis. For examples, some observations might be very close and there is little gain to 
measure their correlation (since it is close to 1); some coordinates may contain measurement 
errors and therefore perfectly assigning them to a dense lattice is unnecessary; there might 
be a need for a even more real-time computation. 

In these cases, we propose the approximation model for FGP. Assume a collection 
of random variables from a functional Gaussian process are located on a lattice S, we 
collect data at x* near the lattice point. To assign one lattice point for one Xj, we dehne 
s* = minsMrgmins.^sjjxi — Sj||. As a result, each lattice point sj may have multiple data 
assigned, we denote the set as Xs- = {ah Xj s.t. s* = s^} 

To model these data, we simply treat them as misaligned: we assign the value on the 
closest lattice point Sj as the mean and an increasing function in the distance ||xj — Sj|| as 
the variance. That is. 


Zs~N(0,QsGQ^ + Ia2), 

ZJZ,. iV(Z,., A(||xi - SjlD) for all x; e Xsj- 


One example of such function is A(||xi — Sj||)) = z/f |xi — Sj||), for which if ||xi — Sj|| = 
0, this reduces to the formulation in (|^. Therefore, the misaligned model is a generalized 
case of (|^, the properties and sampling algorithm of functional Gaussian process stills apply 

in this scenario. The only modification is the posterior ZgJjUgj, {Z:^.}xiGxB- -^((^ + 
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3 Non-stationary Functional Gaussian Process 


We now propose a more general framework that gives arise to a non-stationary Gaussian 
process. Our work is largely inspired by the pioneer work in evolutionary spectrum and semi 
or locally stationary process (Priestley (1965)). The former is to let the spectral process 
change with location Hs = A{s,u})Y{lo), where Y{u}) ~ GW(0,1) for each a;. As the 
modulating function y4(s,ci;) changes with location s, the process and its Fourier transform 
Zs = become non-stationary. The latter is dehned in the sense 

that A(s,li;) is slowly varying with s such that within a small region, the process can be 
treated as stationary. We now combine this notion with the recently popularized mixture 
modeling framework to construct a new non-stationary process. 


3.1 A Stick-Breaking Process with Spectral Dependency 


(9) 




We hrst construct a non-stationary spectral process Hs{u}) on frequency lo G At 
location s: 

Ws(u;) = Tk{u;) with probability 

Tk{^) = 

Pk^s "^k^si J^(l 
j<k 

where gk{^) is the spectral density function of the /cth certain class; Y{u}) ~ GW(0,I) 
is a complex normal vector folded over 0, as dehned before; pk^s is the stick-breaking 
weight that varies in s. Similar to Priestley (1965), the non-stationarity is realized via on 
different modulating functions A(s,li;) = (lj); the distinction is that this function is 

now regulated via a stick-breaking process. 

The uniqueness of this stick-breaking process is that there is only one copy of Y{u:), 
shared by hnite or inhnite many components. This leads to dependent complex normal 
distribution for Tk^{u ))~ CN{Q,g]/^g]J^) for u; > 0, even if ki ^ k 2 - 


3.2 Non-stationary Functional Gaussian Process 

Similar to stationary functional Gaussian process, given the hnite hxed frequency support 
u} G W, we dehne the process in the continuous data domain: 

Zs = QsHs + ^s, where ~ A^(0, a^), (10) 

where Hg is dehned in (|^. The distributional diherences of Hg is controlled by the location 
varying pk^s- Marginalized over p, the covariance between two locations is Cov{Zg^, Zg^) = 
Qsi(E^=iEfe=iPfci,siPfc 2 ,s 2 Gfc(^Gfcf)Q*^ + More importantly, conditional on 

the latent class assignment Cg, all the observations on S are correlated and jointly form a 
multivariate Gaussian distribution with mean 0 and covariance: 

Cov(Z.„Z.,\C„,C.,) = (11) 

We again verify the requirements stated in the following theorem. 
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Theorem 5. For finite set S = {si, S 2 ,s„}, The covariance function in (11) generates 
a positive-definite matrix. The finite dimensional density of Zs satisfies the Kolmogorov 
consistency criteria and therefore can extend to a valid stochastic process Z^d . 


Therefore, we refer (10) as non-stationary functional Gaussian process (NS-FGP). The 


conditional dependency in NS-FGP is inherited from the shared spectral vector Y, despite 
of the different projections into the location space. This is a major distinguishing factor 


of our method from the other stick-breaking constructions like Duan et al. (2007), where 

= 0 if a, ^ a,. 

There are different choices to induce location varying p, such as hidden Markov random 



model is especially appealing for two reasons: hrst, the moderately large magnitude of /i 
(e.g. \fi\ > 3|) in the probit link p = can generate p close to 0 or 1, hence much less 
randomness in C and a clearer clustering pattern; second, this framework can be easily 
extended with our stationary functional Gaussian process to have extremely fast sampling 
speed. We describe the model for p as follows: 


Pk,s J_ J_(^ 

j<k 

^k,s P^i,d-'k,s ^ 0 )) 

Lfc,s ~ N(0, QsMfcQg -I-1), 


( 12 ) 


where is assumed to be from an FGP and is a diagonal matrix formed by a certain 
spectral density function. 


3.3 Posterior Computation 

We use the following data augmentation scheme to facilitate the posterior sampling for 
NS-FGP. To allow for more flexible consideration, we again relax the assumption about the 
homogeneous random error: 


Zs\{Cs = k} = Zs,k + ,where ~ 77(0, 

Cs = k w.p. 

Zs,fc ~ N(^s ;.,Icrfc), 

Ms* = QsGf Y, 

Uk,s ^ 0 ), 

Lfc,s ~ N(0, QsMfcQg -I-1). 

Then the augmented likelihood-prior probability is: 


7VoW(ZsjZs,Cs) X FGP(ZslCs) x ^P(Cs|Lk,s) x PGP(Lk,s) x Prw(0G,0M, 


(14) 
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where Normal is the density for independent normal distribution, FGP is the one for 
stationary functional Gaussian process, SB is the one for stick-breaking process. Similar 
to the stationary case, we assign all the covariance parameters as f/(0,1000) and all the 
scale parameters as /G(0.1,0.1). The full conditional distributions of the two FGPs are 
mutually independent with computational complexity 0{m\og2n). We list the sampling 
algorithm as follows: 

1. Draw Gs; from {1,2,...} from 

p(Gsi =k) (X. l(wfc,si > r-si)(Zg.IW,Si, 

2. Sample fromp(rs;) = Uniform{^,pcs.,Si) ■ 

3. Sample fromp(Lfc,Si) = A^orma/1)1 (hfc,si > 0,Gsi = /c)-FiVorma/(?7fc,Si, l)l(hfc,si < 

0, Gs; > k). And p{Lk^Si) = Normal 1) if Gg; < k or Gg; is unobserved. 

4. Sample from p{0M,k) oc FGP(Lfc,g.|0M,fc)vr(0Ar,fc)- 

5. Sample fromp(? 7 fc g) = MVN{QsMkiMk + I)^^QsLk,s, Qs(Mfc - Mfc(Mfc -h I)“^Mfc)Qs). 

6. Gompute Uk,si = ^{Vk,si) and Wk,si = Uk,si nfc<Kl “ “fc,si)- 

7. Sample from p{6z,k) oc GGP(Z{S:c,=k}|^z,fc)7r(0z,fc)- 

8. Sample from p{Y) = CiV((I + Er Gro,-^)-‘(Er "G^Q* Yr), (I + Er Grrr^:=)->). 

9. Gompute = QsG],^^Y . 

10. Sample from p{Zs,k) = Normal{{\ + 4)"^(4 4), (4 + 4)"^) if Gg = fc; else 

p(ys,k) = Normal{ps,k,(^l)- 


Similar to stationary FGP, the predictive distribution for NS-FGP is Zg^lZsijGsa ~ 
N(Qs2Gs2GsiQsi*(QsiGQg^-hIcr^-hVsJ”^Zsi, Qs2GQs2*+Ic^^+A^S2~Qs2GQsi*(QsiGQ 
Icr^ -|- Vsi)”^QsiGQs 2 *), or simply conditional on the spectral vector ZsajY, Gsa ~ 
N(Qg,G^fY,V + Ia2 ),where the latent Gs^ = Gk with probability ps 2 ,k- 


* -\- 
Si + 


4 Data Applications 

We now demonstrate the use of functional Gaussian process via simulated and real data 
applications. 

4.1 Simulated Data 

We hrst assess the performance of parameter estimation in stationary FGP. We generated 
2,500 locations randomly inside a square space Sj = {xii,Xi 2 ) ~ G(0,100) x G(0,100) (Fig- 
ur^^a)). As most of the existing spatial packages assume isotropic covariance, for compar¬ 
ison, we hrst tested the two isotropic functions: (1) Matern function with the smooth degree 
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at K = 1.5, C{si — Sj) = 0(1 + 


■]\/pi)exp{- 


i||/pi); (2) squared exponential 


function or sometimes called “Gaussian” covariance, C{si — Sj) = (j)exp{—\\si — Sj 


which can be viewed as Matern function with the smooth degree at n = oo (Stein 


V2p?), 

T^. 


We would like to point out that the degree of smoothness k is associated with the degree 
of sparsity in the spectral density matrix G in FGP. In d-dimension, the spectral density 


of Matern family is g{(jj) oc + 


\U} 


\2\—K—d/2 
\d) 


Therefore, the spectral density with larger 


value of K will approach 0 faster. As illustrated in Figured), the squared exponential is 
much closer to 0 (hence much sparser) than Matern k = 1.5 as |a;| increases. Using our 
empirical truncation criterion g{(jj) > 0.01 ■ we found that in Matern k = 1.5, pi = 5 
did not lead to any truncation at all; whereas in squared exponential, the spectral density 
is truncated at m = 13%n, which leads to almost 8 times complexity reduction. 

We compare the results against various spatial methods available in R, such as the 
maximum likelihood method (“geoR”), traditional Bayesian Gaussian process (“spBayes”) 
and predictive process (“spBayes”). The results are listed in Table 

For Matern k = 1.5 , FGP provides the most accurate estimate for the parameters and is 
the fastest method among all, thanks to its spectral algorithm. Gaussian predictive process 
(GPP) (Banerjee et ah, 2008) is also very efficient in computation due to its dimension 
reduction, but its estimate for the range parameter pi deviates from the true value. This 
result is consistent with the recent work of Datta et ah (2015), whose nearest-neighbor 
Gaussian process method showed great improvement over predictive process in parameter 
estimation. 

For squared exponential function, we see a computational time reduction in GPP, due 
to the lower evaluation cost of the covariance function. On the other hand, the sparsity 
in the spectral density suggests that the covariance function is ill-conditioned for matrix 
inversion. This severely affected parameter estimates in other methods: especially in the 
last two Bayesian methods, where we had to use smaller upper bound in the uniform prior 
of p to avoid the the matrix singularity. Nevertheless, FGP is not susceptible to this issue, 
since no matrix inversion is involved. The sparse values in g enables us to test the reduced 
dimension FGP. As shown in the Table the results show almost no difference, while the 
computation time is reduced by 6 times. 



FGP (m=n) 

FGP (m «C n) 

MLE 

Full Bayes GP 

GPP (64 knots) 

Matern with k = 1.5 pi = 5 

5.01(0.10) 


4.78(0.23) 

4.85(0.50) 

7.46(0.47) 

0 = 100 

98.00(16.50) 


86.56(10.54) 

82.33(26.67) 

127.95(6.71) 

cr^ ^ 1 

0.60(0.21) 


1.1(0.51) 

0.93(0.10) 

1.16(0.84) 

Time 

277 secs 


547 secs 

31359 secs 

605 secs 

Squared Exponential pi = 5 

4.83(0.08) 

4.87(0.17) 

6.98(0.23) 

2.87(0.49) 

2.66(0.50) 

0 = 100 

98.79(8.80) 

95.79(6.80) 

100.70(14.57) 

103.25(6.68) 

105.95(7.71) 

0-2 = 1 

0.50(0.13) 

0.53(0.12) 

1.00(0.44) 

0.96(0.11) 

1.52(0.28) 

Time 

237 secs 

40 secs 

759 secs 

2103 secs 

257 secs 


Table 1: Gomparison of estimation for isotropic Gaussian process 


We now assess the performance of the non-stationary FGP. To simulate non-stationary 
surface data, we use the the formulation in |Pintore and Holmes (2004). We use a localized 
squared exponential covariance with C'(si,S 2 ) = 0 hsi,s 2 ea;p(—||si — S 2 |p/asi,s 2)5 where 
asi,s 2 = (a(si) +a(s- 2))/2 and = 2 a(si)^/^Q;(s 2 )^/^/(a(si) 0 ( 82 )) and a(si) = 

2p(si)^. To assign values to the local range parameter, we use a smooth surface from the 
function p(si,S 2 ) = (cos( 47 rsi/ 100 ) -|- 2 )ea;p(s 2 / 200 ) with si,S 2 G (0,100)^ (Figure |^a)). 
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(a) Locations of the simulated data 


(b) Two spectral densities at log^Q scale: 
squared exponential (dashed) decreases 
much faster than Mat&n (solid). 


Figure 1: Simulation for Stationary FGP 


which generates data with different range of correlation (Figure |^e)). The NS-FGP model 
converges to 3 dominating clusters, with distribution pattern highly resembles the one 
for the range parameter (Figure [^b,c,d)). The estimated range parameters correspond 
to different correlation strength (Figure [^f,g,h)). To test the prediction performance, we 
reran the model with a random 80% sample of the data and predict on the remaining 
set. The non-st at ionary FGP model outperforms both the stationary FGP and the GPP 
models in cross validation metrics of root-mean-square error (RMSE) and median absolute 
deviation (MAD). 



NS-FGP 

S-FGP 

GPP (64 knots) 

p 

2.78(0.05), 1.55(0.03) and 1.20(0.08) 

2.15 (0.15) 

3.56 (1.20) 

RMSE 

1.81 

4.96 

10.20 

MAD 

1.62 

3.75 

6.96 

Time 

300 secs 

247 secs 

291 secs 


Table 2: Gomparison of estimation for non-stationary Gaussian process 


4.2 Real Data Application 


We now apply the functional Gaussian process on a massive spatial-temporal dataset. 
The data are obtained from the North American Regional Glimate Ghange Assessment 
Program (NARGGAP). We use the surface air temperature in the North America region 
( Mearns et aLf 2011). The data are simulations from the Weather Research & Forecasting 
regional model (WRF) coupled with the Third Generation Goupled Global Glimate Model 
(GGGM3). We choose the daily average temperate in a 92-day period, from June 1st 
to August 31th in 2000. which has 1,343,752 data points. To evaluate the prediction 
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(b) Weight field 1 


(c) Weight field 2 


(d) Weight held 3 
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(e) Synthesized Data 


(f) Mean held 1 (g) Mean held 2 

Figure 2: Simulation for Non-Stationary FGP 


(h) Mean held 3 


performance, we randomly left out 20% of the data as the testing set. This leads to 
training on 1,075,000 data points, which cost about 3 GBs of computer memory for storage 
alone. Direct estimation of the hne scale matrix with size of 1,075,000 x 1,075,000 would cost 
3,225,000 GBs of memory, which is unrealistic for the modern computers. FGP provides 
a nice solution to this problem: since it uses the Fourier transform of the original matrix, 
it involves at most the same size of the data for each stationary component; often, it 
cost less due to the sparsity of the spectral density (with sparse ratio commonly < 20%). 
We would like to emphasize that, unlike other approaches that uses lower-resolution grid 
for dimension reduction, FGP directly models the full scale matrix and does not involve 
resolution loss. For such a massive and hne-scale system, it took only 27 hours to hnish 
each 30,000-step Markov-Ghain Monte Garlo (MGMG) run. 

We assume the observed temperature Z at space/time point s = {si,S 2 ,t} is 

Zsi,S2,t — Xsj^^S2,t T -^ 81 , 52,4 T ^Sl,S2,t 


where si,S 2 ,t represent the longitude, latitude and day, respectively; the term Xs^^s 2 ,t is 
a hxed linear term which contains the intercept, hrst and second order terms of si,S 2 ,t; 
the second term Zs^^s 2 ,t is assumed be a location varying term; the last term e{xs,ys,ts) 

represents the random error, which is assumed es^,s 2 ,t ^(0, i^si S 21 )- 

We applied this NS-FGP to model the spatial varying term Zsj^^s 2 ,t- We parameterize 
the non-stationary model, by assigning the component weight and component mean with 
the isotropic covariance function Cov{{si, S 2 ,t), {s[, S 2 ,t')} = (j)exp{— -- 

' ^^2 ) + cr '^{si,s 2 ,t)=(s[,s' 2 ,t')- The spectral density is the product of three Fourier trans¬ 


forms. We also tested multiplying an extra space-time interaction term exp{ 
|^ 2 -g 2 \ \t-t I ^ term, as suggested by 


Gressie and Huang (1999 


Cl 

Nevertheless, 
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we found that the posterior estimates of Ci and C 2 for this data are quite large (> 10®), so 
that the interaction term is negligible. Therefore, we restrict the following analyses on the 
non-interactive isotropic model. 

We ran MCMC sampling for 30,000 steps and use the last 20,000 steps with 10-step 
thinning as the posterior sample. To approximate the inhnite component assumption, 
we started with 16 components. The NS-FGP quickly converges to 3 major components. 
The results for parameter estimation are listed in Table We use the truncation rule 
oi g > 0.01(7^ in likelihood evaluation, which leads a dimension reduction in the spectral 
density values from 1,075,000 to only ~53,000 (~ 5% truncation rate). We repeated the 
sampling for 3 times using different random numbers as the starting values, and found 
the model converges to similar configurations and close parameter estimates (Figure |^a)). 
And the trace also suggests the model has good convergence (Figure |^b)). 



T-!-1-J-1—^ 

0 5000 10000 15000 20000 

step 



Lag 


(a) Trace plot of parameter pt for the (b) Autocorrelation plot of parameter 
1st component mean, collected from 3 pt for the 1st component mean 
independet runs 


Figure 3: The diagonistics show good mixing of the chain produced in the posterior sam¬ 
pling 



NS-FGP 

Stationary FGP 


Component 1 

Component 2 

Component 3 


Mean 

<f> 

24.80 (1.52) 

38.27 (5.95) 

9.89 (0.52) 

48.42 (5.62) 


Pi 

4.22 (0.05) 

1.67 (0.02) 

2.51 (0.04) 

5.34 (0.45) 


P2 

4.90 (0.03) 

2.19 (0.04) 

3.05 (0.05) 

5.75 (0.56) 


pt 

0.43 (0.02) 

0.09 (0.01) 

2.95 (0.12) 

0.54 (0.06) 

Weight 

<t> 

1.58 (1.21) 

57.45 (15.16) 

65.05 (15.44) 



pi 

20.13 (2.53) 

5.15 (0.11) 

4.13 (0.09) 



P2 

15.41 (1.20) 

6.76 (0.15) 

5.31 (0.12) 



Pt 

12.04 (1.25) 

0.73 (0.02) 

0.67 (0.02) 


Clustering 

C Proportion ( %) 

64% 

17% 

19% 


Prediction Performance 

RMSE 

1.13 



2.75 


MAD 

0.65 



1.26 


Table 3: The parameter estimates and cross-validation performance in non-stationary and 
stationary model. 


We plot the data without the estimated trend l3'^Xs^,s2,t 


( Figure (a)), the mean 
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estimate (Figure (b)) and the weight estimate (Figure (c)) for each major component 
from June 1st,2000 to June 3rd, 2000 . The plots are organized by columns and each 
column represents one day. The NS-FGP captures different strengths of correlation in the 
mean estimates. Compared vertically, the three components seem to correspond to the 
oceanic, coastal and the localized whether patterns in North America. Thanks to the large 
scale 0 for the latent weight process L, we see the weight p close to 0 or 1 in most locations. 
This suggests a quite stable clustering, conditioning on which we can claim joint normality 
in the whole region. Since we used daily temperature data, we observed large variation 
of temperature pattern from day to day (compared horizontally in Figure]^ (a)). In the 
model estimates, we indeed saw low temporal correlation (Table and dynamic changes 
in the distribution of the mean and the weight (Figure (b,c)). 

Lastly we tested the prediction performance of the NS-FGP. Since we do not know the 
cluster assignment for the predicted locations, we hrst use the Cs^ = arg ma.XkPsj,k as the 
estimator for Cs ■ As mentioned above, we have max^ps ~ 1 in almost all the locations. 
In the prediction metrics, as shown in Table NS-FGP produced quite accurate prediction. 
As a comparison, we also ran the stationary FGP model on the data. The stationary FGP 
seemed to overly smooth the data, therefore is less accurate than the non-st at ionary model. 


5 Discussion and Future Work 


Our proposed method provides a new construction of Gaussian process that directly con¬ 
nects the spectral properties to its applications, such as parameter estimation and predic¬ 
tion. There are several extensions worth researching in the future. First, since space-time 
interaction is easier to obtain via spectral convolution than covariance function construc¬ 
tion, it is interesting to relax the form of the spectral density, regardless of whether the 
closed form of covariance function exists. Second, in the non-stationary method, we provide 
a general mixture framework with spectral dependency and we use probit stick-breaking 
process for illustration. Some other clustering approaches such as Pitman-Yor process (Ish- 


waran and James, 2001) may be studied to have more components. Third, the spectral 


properties in multivariate analysis can be studied, with a spatial correlation across differ¬ 
ent dependent variables. Fourth, more theoretic studies can be pursued, such as objective 
Bayesian priors and the posterior consistency. 
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(a) De-trended Data 


June 1,2000: Component 1 



June 2,2000: Component 1 


I 

I 



June 3, 2000: Component 1 




0 

-5 


I 


-10 

-15 


June 1,2000: Component 2 


June 2, 2000: Component 2 


June 3, 2000: Component 2 



June 1,2000: Component 3 June 2,2000: Component 3 June 3, 2000: Component 3 



1 

i:: 

. ' 

1 

i;: 


1 

V«* 

> * 

1 

■-5 

-0 

-5 

■--15 

1 

■5 

-0 

-5 

■--15 

' 

1 


(b) Component mean 
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(c) Component weight 

Figure 4: The surface temperature data in three days and the three stationary components 
estimated by NS-FGP 
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SUPPLEMENTARY MATERIAL 


5.1 Proof of Theorems 

5.1.1 Proof of theorem 1 

We first prove the covariance ([^ is real. Because of the symmetric function in g{{ui, ...,ujk, 
—cuk, for any k = 1,2, and symmetric distribution of {tj/} about 0 , 

we have 

n n 

exp{iu;J{sj -Sk)}g{u;i) = '^{cos{u;J{sj -Sk))giu;i) +ism{u;J{sj - Sk))giu;i)} 

1=1 i=i 

[n/2\ 

= 2 ^ {cos{u}J{sj - Sk))g{uji)} + g{0) 

i=i 

where n is assumed to be an odd number; if n is even, we remove ^'(O) on the right hand 
side. In either case, the imaginary parts are canceled. 

To prove the positive dehniteness, we use the matrix representation S = QGQ* + Icx^. 
For any nontrivial real vector X of size iV, we have: 

X'HX = X'QGQ*X + X'Xa"^ 

It is trivial that X'Xa"^ > 0 for > 0 . Now we denote Q*X = Y = U + 
where Yi and are the real and the imaginary parts of the transform of X. We have 
X'QGQ*X = Yi'GYi + Ya'GYa > 0, as each element of G satishes g{.) > 0. Combining 
two parts, we prove that for any nontrivial X, X'HX > 0, which is the dehnition of 
positive dehniteness. 


5.1.2 Proof of theorem 2 

As the covariance function without the nugget cr^ can be viewed as 


G(x) = [ expii^^uj) [ exp{-i^^uj)C{^)d^duj = [ exp{i^^uj)g{uj)duj 

JRd jRd J^d 


Denote the specihed covariance function as Gof(x) = 


W = ..., ^Ai} X ... X {-^A 

I- rii m ' ni 


ni 


ni 


rtd 


rid 


Cov{sj,Sk) and the subregion 
Arf,..., — Ad} then we have: 

a, , aj 


||Gon(x)-G(x)|| < 


' wew 


exp{ix^(jj)g{(jj)d(jj — E exp{ix^u;i}g{u;i)/n\ 


+ 


cos (x'^LJ) S' (^) dijJ 


1=1 


Since || cos(x^Lj)5f(Lj)|| < e, we let e = 1 /rn? and use the dominated convergence 
theorem lim^^oocos(x'^a;)5f(a;)da; = hm,^^oo cos(x^a;)5((a;)da; = 0 . And the 
hrst part corresponds to the error of the middle Riemann sum: 
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lim 

m^oo 


fu;eW 

where K is a finite constant. 


exp{i:>c^u})g{u})du} — exp{i:x.^u}i}g{uJi)/n\\ < K/w? 


i=i 


5.1.3 Proof of theorem 3 

We first prove the exchangeable condition. For any permutation of location vector 8,^ = 
{s^i, s^ 2 ; •••; s^rn} and the random variables Zs^, we define the permutation matrix such 
that = P^S. Then we have Zg, = P^rZs, Qs, = PttQs and P^P'^ = P^P^ = I. 

Since (Pj^SP'^)”^ = P^rS'^P'^ and |P^SP^| = IP'^P^^SI = |S| for any positive definite 
S, which we showed for Zg^ in theorem 1. 

p(s^i, s^2,s.n) =(27r)-’^/2|p^5.p^|-i/2g^p^_2/^p/^(p^5.p/j-ip^2g/2) 

= (27r)-’^/2|S|”i/2ea;p(-ZsS-^Zg/2) 

=p(Si, S2, ..., S„) 

Next, for a finite location set So = {si, S 2 ,..., s„} and any location G we have the 
joint distribution: 




QsGQg + IcT^ QgGQgj^ 
Qs,GQ^ Q,,GQ:^ + Ia2 


Using normal theory, it is straightforward to verify that: 


[ p(S„,sO<ist = (27r)-”/"|Qs„GQJ„+Iffy'/2gip(-Z;,„(Qs,GQ|„+Iff2)-iZs„/2) =p(S„) 
5.1.4 Proof of theorem 4 

The jth row of Qg, Qg^, is an (nin 2 ...nrf)-element vector, and j = ji + ... + jd with 
jk G {0, ■■■,rnk — 1} and mt < Uk- This row is composed of the {nin 2 ...nd) elements of the 
tensor product ® qj^® ... ® qj^, where represents a vector in the kth sub-dimension: 

qj^ = {exp[-i{-—7^)jk]/^/^,exp[-i{-— — -7i)jk]/...,exp[-i{—7^)jk]/^/^} 

Tlk Tlk Tlk 

which is the jkth Fourier basis, which has the orthogonality. That is, given another row 
/, if jk = Ik then q'j^ql^ = 1, else q'j^qi^ = 0. Using the rule of tensor product, we have 
QsjQsi = 1 only if j = I, else 0 . 

When nik = Uk, we have Qg = Qg and QsQs = QsQs = (Qs)*(Qs)- Using the 
similar proof as above, except for changing —i to i, we have QsQs = I- 
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5.1.5 Proof of theorem 5 

We first show the positive dehniteness of the non-stationary covariance function Cov{Zs ^, Zs^ |Csj, Csk) 
+ cr‘^lj=k- For simplicity of notation, we abbreviate Qsj as Qj and G^^^ 

J J 

1 /2 

as Gj . Then we have the follow matrix decomposition: 


Q2G2/'G}/'Qt 


QiG}/'Gy'Q* 

pil/2^1/2^=^ 


QnGy'G;/'Qt 


Qi 0' 

0' Q2 


0 ' 

0 ' 


0' 0' ... Q, 

ABB'A* + la^ 


- 

gP' 




-1 

: 

■ 0 




1/2 ^1/2 


QnGy^Gy'Q; + 


G\'^ G 


G 


1/2 


QI 0 

0 q; 


0 

0 


0 0 q; 


+ 


0 

0 

0 0 


0 

0 


where Q(,) is 1-by-m matrix, 0 is n-by-1 zero matrix and A and B represent two corre¬ 
sponding block matrices. For any nontrivial vector X, we have X'HiX = X^ABB'A*X -|- 
X'Xa^ = y/l^i + y2'y2 + X'Xa^ > O, where B'A*X = Yi + /y2. 

The two Kolmogorov consistency conditions are satished since the joint normal density 
is dehned for every location set. The proof is similar to the proof of theorem 3. 
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